Corpus Based Method of Transforming Nominalized Phrases into Clauses for Text Mining Application
نویسنده
چکیده
Nominalization is a linguistic phenomenon in which events usually described in terms of clauses are expressed in the form of noun phrases. Extracting event structures is an important task in text mining applications. To achieve this goal, clauses are parsed and the argument structure of main verbs are extracted from the parsed results. This kind of preprocessing has been commonly done in the past research. In order to extract event structure from nominalized phrases as well, we need to establish a technique to transform nominalized phrases into clauses. In this paper, we propose a method to transform nominalized phrases into clauses by using corpus-based approach. The proposed method first enumerates possible predicate/argument structures by referring to a nominalized phrase (noun phrase) and makes their ranking based on the frequency of each argument in the corpus. The algorithm based on this method was evaluated using a corpus consisting of 24,626 aviation safety reports in English and it achieved a 78% accuracy in transformation. The algorithm was also evaluated by applying a text mining application to extract events and their cause-effect relations from the texts. This application produced an improvement in the text mining application’s performance. key words: nominalization, predicate/argument structure, text mining, corpus based method
منابع مشابه
Annotation of Predicate-argument Structure on Molecular Biology Text
Annotated corpora are essential resources for natural language processing. This paper describes our approach for building a corpus annotated with predicateargument structure on research abstracts in molecular biology domain. Observation of the records in a database of cell signaling events and corresponding research abstracts showed that extracting predicateargument structure is a useful interm...
متن کاملUsing Loglinear Clustering for Subcategorization Identification
In this paper we will describe a process for mining syntactical verbal subcategorization, i.e. the information about the kind of phrases or clauses a verb goes with. We will use a large text corpus having almost 10,000,000 tagged words as our resource material. Loglinear modeling is used to analyze and automatically identify the subcategorization dependencies. An unsupervised clustering algorit...
متن کاملCorpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملExtracting Syntax Statistics from Large Corpora of Written English
The field of linguistics has seen a growing interest in the statistics of everyday language. In studying how we acquire language and why some of its aspects are more difficult for us than others, it is critical to understand the linguistic environment to which we are exposed. However, gathering statistics over syntactic structures, even with a syntactically tagged corpus, can be difficult and t...
متن کاملText mining for systems biology and MetNet
............................................................................................................................ iv Chapter 1. Background: text mining of biological literature for interaction extraction ........................................................................................................................... 1 1.1 Review of interaction extraction methods ..............
متن کامل